A Corpus-Based Study of Phoneme Distribution in Thai

نویسندگان

  • Adirek Munthuli
  • Ploypailin Sirimujalin
  • Charturong Tantibundhit
  • Krit Kosawat
  • Chutamanee Onsuwan
چکیده

This paper presents steps in accessing Thai phoneme distribution from large-scale written Thai corpora. The data were from 12 text genres from InterBEST [1], considered the biggest Thai corpora. Each word was transliterated using the grapheme-to-phoneme software [2]. Then, frequency of words, frequency of 81 Thai phonemes in each genre, and the 95% CIs of average occurrences of each phoneme were calculated. Phonemes from any genre that did not fall within the 95% CI were counted. As a result, 3 genres whose distributions are highly incompatible with others were removed, resulting in the remaining of approximately 80% of the data. Finally, we obtained phoneme distribution of initials, finals, vowels, and tones. Importantly, 4 bigram frequencies (final-to-tone, vowel-to-final, initial-totone, and initial-to-vowel) and a trigram frequency (vowel-final-tone) were also given.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM

Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...

متن کامل

Example-based grapheme-to-phoneme conversion for Thai

Several characteristics of the Thai writing system make Thai grapheme-to-phoneme (G2P) conversion very challenging. In this paper, we propose an Example-Based Grapheme-toPhoneme conversion approach. It generates the pronunciation of a word by selecting, modifying and combining pronunciations from syllables from training corpus. The best system achieves 80.99% word accuracy and 94.19% phone accu...

متن کامل

Example-Based Grapheme-to-Phon

Several characteristics of the Thai writing system make Thai grapheme-to-phoneme (G2P) conversion very challenging. In this paper, we propose an Example-Based Grapheme-toPhoneme conversion approach. It generates the pronunciation of a word by selecting, modifying and combining pronunciations from syllables from training corpus. The best system achieves 80.99% word accuracy and 94.19% phone accu...

متن کامل

Frequency of occurrence of phonemes and syllables in Thai: Analysis of spoken and written corpora

This work provides detailed frequency and distribution of Thai phonemes, biphones, and syllable types drawn from three large-scale Thai corpora (InterBEST, LOTUS-BN, and LOTUS-Cell 2.0). Comparisons are carried out to examine an extent to which linguistic variation, associated with different corpus types (written vs. spoken), affects frequency statistics and distribution patterns. Results and s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013